Peer-to-Peer lending Domain

https://www.investopedia.com/terms/p/peer-to-peer-lending.asp

Tasks

  • predict

About Personal loan

A personal loan allows you to borrow money from a lender for almost any purpose, typically with a fixed term, a fixed interest rate, and a regular monthly payment schedule. Collateral usually is not required.


Lending Club

LendingClub is a peer-to-peer lending company headquartered in San Francisco California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC) and to offer loan trading on a secondary market. At its height LendingClub was the world’s largest peer-to-peer lending platform. []

_images/lendingclub.png

Fig. 1 lending club website

Business Understanding

The LendingClub company specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

  • If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

  • If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

The data given contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for takin actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

When a person applies for a loan, there are two types of decisions that could be taken by the company:

  1. Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:

    • Fully paid: Applicant has fully paid the loan (the principal and the interest rate)

    • Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as ‘defaulted’.

    • Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan

  2. Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import hvplot.pandas
from pathlib import Path

main_path = Path().absolute().parent

Exploratory Data Analysis

Data Description

LoanStatNew

Description

loan_amnt

The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

term

The number of payments on the loan. Values are in months and can be either 36 or 60.

loan_status

Current status of the loan.

int_rate

Interest Rate on the loan.

installment

The monthly payment owed by the borrower if the loan originates.

grade

LC assigned loan grade.

sub_grade

LC assigned loan subgrade.

emp_title

The job title supplied by the Borrower when applying for the loan.

emp_length

Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

home_ownership

The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

annual_inc

The self-reported annual income provided by the borrower during registration.

verification_status

Indicates if income was verified by LC, not verified, or if the income source was verified.

issue_d

The month which the loan was funded.

purpose

A category provided by the borrower for the loan request.

title

The loan title provided by the borrower.

dti

A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrowe’s self-reported monthly income.

earliest_cr_line

The month the borrower’s earliest reported credit line was opened.

open_acc

The number of open credit lines in the borrower’s credit file.

pub_rec

Number of derogatory public records.

pub_rec_bankruptcies

Number of public record bankruptcies.

revol_bal

Total credit revolving balance.

revol_util

“Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

total_acc

The total number of credit lines currently in the borrower’s credit file.

initial_list_status

The initial listing status of the loan. Possible values are – W, F

application_type

Indicates whether the loan is an individual application or a joint application with two co-borrowers.

mort_acc

Number of mortgage accounts.

addr_state

The state provided by the borrower in the loan application.

data_path = main_path / 'data' / 'p2p' / 'lending_club' / 'processed'
# import data
df = pd.read_csv( data_path / 'accepted.csv')

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1345310 entries, 0 to 1345309
Data columns (total 27 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   loan_amnt             1345310 non-null  float64
 1   term                  1345310 non-null  object 
 2   loan_status           1345310 non-null  object 
 3   int_rate              1345310 non-null  float64
 4   installment           1345310 non-null  float64
 5   grade                 1345310 non-null  object 
 6   sub_grade             1345310 non-null  object 
 7   emp_title             1259525 non-null  object 
 8   emp_length            1266799 non-null  object 
 9   home_ownership        1345310 non-null  object 
 10  annual_inc            1345310 non-null  float64
 11  verification_status   1345310 non-null  object 
 12  issue_d               1345310 non-null  object 
 13  purpose               1345310 non-null  object 
 14  title                 1328651 non-null  object 
 15  dti                   1344936 non-null  float64
 16  earliest_cr_line      1345310 non-null  object 
 17  open_acc              1345310 non-null  float64
 18  pub_rec               1345310 non-null  float64
 19  revol_bal             1345310 non-null  float64
 20  revol_util            1344453 non-null  float64
 21  total_acc             1345310 non-null  float64
 22  initial_list_status   1345310 non-null  object 
 23  application_type      1345310 non-null  object 
 24  mort_acc              1298029 non-null  float64
 25  pub_rec_bankruptcies  1344613 non-null  float64
 26  addr_state            1345310 non-null  object 
dtypes: float64(12), object(15)
memory usage: 277.1+ MB

loan status

Current status of the loan.

df['loan_status'].value_counts().hvplot.bar(
    title="Loan Status Counts", xlabel='Loan Status', ylabel='Count', 
    width=500, height=350, yformatter='%d'
)

loan_amnt, installment & int_rate

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

  • installment: The monthly payment owed by the borrower if the loan originates.

  • int_rate: Interest Rate on the loan.

installment = df.hvplot.hist(
    y='installment', by='loan_status', subplots=False, 
    width=400, height=400, bins=50, alpha=0.4, 
    title="Installment by Loan Status", 
    xlabel='Installment', ylabel='Counts', legend='top',
    yformatter='%d'
)

loan_amnt = df.hvplot.hist(
    y='loan_amnt', by='loan_status', subplots=False, 
    width=400, height=400, bins=30, alpha=0.4, 
    title="Loan Amount by Loan Status", 
    xlabel='Loan Amount', ylabel='Counts', legend='top',
    yformatter='%d'
)

int_rate = df.hvplot.hist(
    y='int_rate', by='loan_status', subplots=False, 
    width=400, height=400, bins=30, alpha=0.4, 
    title='Interest Rate by Loan Status', 
    xlabel='Interest Rate', ylabel='Counts', legend='top',
    yformatter='%d'
)

installment + loan_amnt + int_rate
installment_box = df.hvplot.box(
    y='installment', subplots=True, by='loan_status', width=300, height=400, 
    title='Loan Status by Installment', xlabel='Loan Status', ylabel='Installment', legend=False
)

loan_amnt_box = df.hvplot.box(
    y='loan_amnt', subplots=True, by='loan_status', width=300, height=400, 
    title='Loan Status by Loan Amount', xlabel='Loan Status', ylabel='Loan Amount', legend=False
)

int_rate_box = df.hvplot.box(
    y='int_rate', subplots=True, by='loan_status', width=300, height=400, 
    title='Loan Status by Interest Rate', xlabel='Loan Status', ylabel='Interest Rate', legend=False
)

loan_amnt_box + installment_box + int_rate_box

grade & sub_grade

  • grade: LC assigned loan grade.

  • sub_grade: LC assigned loan subgrade.

fully_paid = df.loc[df['loan_status']=='Fully Paid', 'grade'].value_counts().hvplot.bar() 
charged_off = df.loc[df['loan_status']=='Charged Off', 'grade'].value_counts().hvplot.bar() 

(fully_paid * charged_off).opts(
    title="Loan Status by Grade", xlabel='Grades', ylabel='Count',
    width=500, height=450, legend_cols=2, legend_position='top_right'
)
fully_paid = df.loc[df['loan_status'] == 'Fully Paid', 'sub_grade'].value_counts().hvplot.bar() 
charged_off = df.loc[df['loan_status'] == 'Charged Off', 'sub_grade'].value_counts().hvplot.bar() 

(fully_paid * charged_off).opts(
    title="Loan Status by Grade", xlabel='Grades', ylabel='Count',
    width=500, height=400, legend_cols=2, legend_position='top_right', xrotation=90
)

usually giving the money to grade A - D, but cannot say that the people who has lower grade pay less on their loan.

df.loc[df['grade'].isin(['E', 'F', 'G'])].groupby(['loan_status', 'grade'])['grade'].count()
loan_status  grade
Charged Off  E        36035
             F        14491
             G         4560
Fully Paid   E        57615
             F        17567
             G         4572
Name: grade, dtype: int64

term, home_ownership, & purpose

  • term: The number of payments on the loan. Values are in months and can be either 36 or 60.

  • home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

  • purpose: A category provided by the borrower for the loan request.

df.groupby(['loan_status'])[['home_ownership']].value_counts().rename('Count').hvplot.table()
home_ownership = df.groupby(['loan_status'])[['home_ownership']].value_counts().rename('Count').hvplot.bar()
home_ownership.opts(
    title="Home Ownership by Loan Status", xlabel='Home Ownership / Loan Status', ylabel='Count',
    width=900, height=450, show_legend=True, yformatter='%d'
)
term = df.groupby(['loan_status'])[['term']].value_counts().rename('Count').hvplot.bar()
term.opts(
    title="Term by Loan Status", xlabel='Term / Loan Status', ylabel='Count',
    width=600, height=450, show_legend=True, yformatter='%d'
)
term = df.groupby(['loan_status'])[['purpose']].value_counts().rename('Count').hvplot.bar()
term.opts(
    title="Purpose by Loan Status", xlabel='Purpose / Loan Status', ylabel='Count',
    width=700, height=450, show_legend=True, yformatter='%d', xrotation=90
)

annual_inc & verification_status

  • annual_inc: The self-reported annual income provided by the borrower during registration.

  • verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified

df.groupby(['loan_status', 'verification_status'])['annual_inc'].describe().hvplot.table(title='Annual Income Table Description By Verification')
(df.groupby(['loan_status'])[['verification_status']].value_counts() / df.groupby(['loan_status'])['verification_status'].count())\
    .rename('percentage').hvplot.table(title='Income Verified Rate')
def is_outlier(x): 
    iqr = np.percentile(x, 75) - np.percentile(x, 25)
    upper = np.percentile(x, 75) + (iqr * 1.5)
    lower = np.percentile(x, 25) - (iqr * 1.5)

    return (x > upper) | (x < lower)


annual_inc = df.loc[~df.groupby(['loan_status', 'verification_status'])['annual_inc'].apply(is_outlier), 
    ['loan_status', 'verification_status', 'annual_inc']].hvplot.hist(
    y='annual_inc', by='loan_status', groupby='verification_status', subplots=False, 
    width=900, height=400, bins=40, alpha=0.4, title='Annual Income(1Q~3Q +/- 1.5*IQR) Distsribution by Loan Status', 
    xlabel='Annual Income', ylabel='Counts', legend='top', yformatter='%d', xformatter='%d', dynamic=False
)
annual_inc

emp_title & emp_length

  • emp_title: The job title supplied by the Borrower when applying for the loan.

  • emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

check_null = lambda x: x.isnull().sum()
df_emp_null = df.loc[:, ['emp_title', 'emp_length']].apply([check_null, pd.Series.nunique]).rename(index={'<lambda>': 'nnull'}).reset_index().hvplot.table(
    title='Job Title & Employment length: NA values', height=100
)
df_emp_top20 = df['emp_title'].value_counts().reset_index().rename(columns={'index': 'emp_title', 'emp_title': 'count'})[:20].hvplot.table(
    title='Job Title Top 20'
)
df_emp_null
df_emp_top20
df['emp_length'].fillna('unknown', inplace=True)
df['emp_title'].fillna('unknown', inplace=True)
df['emp_title'] = df['emp_title'].str.lower()  # Unify into lower cases
df_emp_top20 = df['emp_title'].value_counts().reset_index().rename(columns={'index': 'emp_title', 'emp_title': 'count'})[:20].hvplot.table(
    title='Job Title Top 20'
)
df_emp_top20
from itertools import product

loan_status_order = ['Charged Off', 'Fully Paid']
emp_length_order = ['unknown', '< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years']
emp_length = df.groupby(['loan_status'])[['emp_length']].value_counts().reindex(list(product(*[loan_status_order, emp_length_order]))).rename('Counts').hvplot.barh(stacked=True, legend='right')
emp_length.opts(
    title='Loan Status by Employment Length in years', height=400, width=900, xlabel='Counts', ylabel='Employment Length in years', xformatter='%d'
)

issue_d & earliest_cr_line

  • issue_d: The month which the loan was funded.

  • earliest_cr_line: The month the borrower’s earliest reported credit line was opened.

df['issue_d'] = pd.to_datetime(df['issue_d'])
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])

fully_paid = df.loc[df['loan_status']=='Fully Paid', 'issue_d'].hvplot.hist(bins=35) 
charged_off = df.loc[df['loan_status']=='Charged Off', 'issue_d'].hvplot.hist(bins=35)

# fully_paid * charged_off
loan_issue_date = (fully_paid * charged_off).opts(
    title="Loan Status by Loan Issue Date", xlabel='Loan Issue Date', ylabel='Count',
    width=450, height=350, legend_cols=2, legend_position='top_right'
).opts(xrotation=45, yformatter='%d')

fully_paid = df.loc[df['loan_status']=='Fully Paid', 'earliest_cr_line'].hvplot.hist(bins=35) 
charged_off = df.loc[df['loan_status']=='Charged Off', 'earliest_cr_line'].hvplot.hist(bins=35)

earliest_cr_line = (fully_paid * charged_off).opts(
    title="Loan Status by earliest_cr_line", xlabel='earliest_cr_line', ylabel='Count',
    width=450, height=350, legend_cols=2, legend_position='top_right'
).opts(xrotation=45, yformatter='%d')

loan_issue_date + earliest_cr_line

title

print(df['title'].isnull().sum())
16659
df['title'] = df['title'].str.lower()
df['title'].value_counts()[:10]
debt consolidation         682107
credit card refinancing    250295
home improvement            77604
other                       66862
major purchase              24334
medical expenses            13681
business                    11843
car financing               11245
vacation                     8233
moving and relocation        7952
Name: title, dtype: int64

dti, open_acc, pub_rec, pub_rec_bankruptcies

  • dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrowe’s self-reported monthly income.

  • open_acc: The number of open credit lines in the borrower’s credit file.

  • pub_rec: Number of derogatory public records.

  • pub_rec_bankruptcies: Number of public record bankruptcies.

df['dti'].describe().reset_index().hvplot.table(title='DTI Table Description', height=250)
# Can DTI be 999?
dti = df.hvplot.hist(
    y='dti', bins=50, width=450, height=350, 
    title="dti Distribution", xlabel='dti', ylabel='Count'
).opts(yformatter='%d')
dti_sub = df.loc[df['dti'] < 100].hvplot.hist(
    y='dti', bins=50, width=450, height=350, 
    title="dti(<100) Distribution", xlabel='dti', ylabel='Count', shared_axes=False
).opts(yformatter='%d')

dti_sub2 = df.loc[df['dti'] > 40].hvplot.hist(
    y='dti', bins=100, width=450, height=350, 
    title="dti(>40) Distribution", xlabel='dti', ylabel='Count', shared_axes=False
).opts(yformatter='%d')

dti + dti_sub + dti_sub2
dti = df[df['dti']<=50].hvplot.hist(
    y='dti', by='loan_status', bins=50, width=450, height=350, 
    title="dti (<=50) Distribution", xlabel='dti', ylabel='Count', 
    alpha=0.3, legend='top'
).opts(yformatter='%d')

open_acc = df.hvplot.hist(
    y='open_acc', by='loan_status', bins=50, width=450, height=350, 
    title='Loan Status by The number of open credit lines', xlabel='The number of open credit lines', ylabel='Count', 
    alpha=0.4, legend='top'
).opts(yformatter='%d')

total_acc = df.hvplot.hist(
    y='total_acc', by='loan_status', bins=50, width=450, height=350, 
    title='Loan Status by The total number of credit lines', xlabel='The total number of credit lines', ylabel='Count', 
    alpha=0.4, legend='top'
).opts(yformatter='%d')

dti + open_acc + total_acc

revol_bal & revol_util

What is Revolving balance?

In credit card terms, a revolving balance is the portion of credit card spending that goes unpaid at the end of a billing cycle.

  • revol_bal: Total credit revolving balance.

  • revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

revol_util = df.hvplot.hist(
    y='revol_util', by='loan_status', bins=50, width=450, height=400, 
    title='Loan Status by Revolving line utilization rate', xlabel='Revolving line utilization rate', ylabel='Count', 
    alpha=0.4, legend='top'
).opts(yformatter='%d')

revol_util_sub = df[df['revol_util'] < 120].hvplot.hist(
    y='revol_util', by='loan_status', bins=50, width=550, height=400, 
    title='Loan Status by Revolving line utilization rate (< 120)', xlabel='Revolving line utilization rate', ylabel='Count', 
    shared_axes=False, alpha=0.4, legend='top'
).opts(yformatter='%d')

revol_util + revol_util_sub